Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

نویسندگان

  • Sébastien Bubeck
  • Nicolò Cesa-Bianchi
چکیده

Multi-armed bandit problems are the most basic examples of sequential decision problems with an exploration–exploitation trade-off. This is the balance between staying with the option that gave highest payoffs in the past and exploring new options that might give higher payoffs in the future. Although the study of bandit problems dates back to the 1930s, exploration–exploitation trade-offs arise in several modern applications, such as ad placement, website optimization, and packet routing. Mathematically, a multi-armed bandit is defined by the payoff process associated with each option. In this monograph, we focus on two extreme cases in which the analysis of regret is particularly simple and elegant: i.i.d. payoffs and adversarial payoffs. Besides the basic setting of finitely many actions, we also analyze some of the most important variants and extensions, such as the contextual bandit model.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Analysis of Thompson Sampling for the Multi-armed Bandit Problem

The multi-armed bandit problem is a popular model for studying exploration/exploitation trade-off in sequential decision problems. Many algorithms are now available for this well-studied problem. One of the earliest algorithms, given by W. R. Thompson, dates back to 1933. This algorithm, referred to as Thompson Sampling, is a natural Bayesian algorithm. The basic idea is to choose an arm to pla...

متن کامل

Following the Perturbed Leader to Gamble at Multi-armed Bandits

Following the perturbed leader (fpl) is a powerful technique for solving online decision problems. Kalai and Vempala [1] rediscovered this algorithm recently. A traditional model for online decision problems is the multi-armed bandit. In it a gambler has to choose at each round one of the k levers to pull with the intention to minimize the cumulated cost. There are four versions of the nonstoch...

متن کامل

Bandit Regret Scaling with the Effective Loss Range

We study how the regret guarantees of nonstochastic multi-armed bandits can be improved, if the effective range of the losses in each round is small (e.g. the maximal difference between two losses in a given round). Despite a recent impossibility result, we show how this can be made possible under certain mild additional assumptions, such as availability of rough estimates of the losses, or adv...

متن کامل

Combinatorial Bandits Revisited

This paper investigates stochastic and adversarial combinatorial multi-armed bandit problems. In the stochastic setting under semi-bandit feedback, we derive a problem-specific regret lower bound, and discuss its scaling with the dimension of the decision space. We propose ESCB, an algorithm that efficiently exploits the structure of the problem and provide a finite-time analysis of its regret....

متن کامل

Deviations of Stochastic Bandit Regret

This paper studies the deviations of the regret in a stochastic multi-armed bandit problem. When the total number of plays n is known beforehand by the agent, Audibert et al. (2009) exhibit a policy such that with probability at least 1− 1/n, the regret of the policy is of order logn. They have also shown that such a property is not shared by the popular ucb1 policy of Auer et al. (2002). This ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Foundations and Trends in Machine Learning

دوره 5  شماره 

صفحات  -

تاریخ انتشار 2012